Today, the video game industry is one of the largest in the entertainment industry. Its capitalization in 2021 reached more than USD 190 billion with an increase of more than 9% per year. The expected value of the industry by 2026 is estimated at USD 314 billion. Every year the video game industry is growing more and more. But it was not always so.
Initially, in the mid-70s, it was a small community of enthusiasts. But over the years it has increased more and more, and with it the profit has grown. From low-profit projects over several decades, the total annual profit has grown to USD 9.5 billion, and in 2021 this figure was USD 60 billion. Every year, a large number of new projects come out, both from large and influential companies, as well as from indie developers. The variety of games attracts more and more people. The availability of games also contributes to this. Even on non-gaming computers, you can run indie games. On the road, you can always kill time in arcade games. In general, over time, gaming has become ubiquitous.
This project will provide the analysis of Video Game sales from 1980 to 2020 in 3 different regions.
This dataset was scrapped from https://www.vgchartz.com/. It contains a list of video games from 1980 to 2020 with more than 100,000 saled copies. Total number of this records is 16,598. Below is attributes we will use in our analysis:
For this project, analysis and visualization consists of 4 parts:
For easier and clean work with data we will make some manipulation:
# Importing necessary libraries
import pandas as pd
import numpy as np
# Reading our data frame from csv file
df = pd.read_csv('vgsales.csv')
df.head(10)
| Rank | Name | Platform | Year | Genre | Publisher | NA_Sales | EU_Sales | JP_Sales | Other_Sales | Global_Sales | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Wii Sports | Wii | 2006.0 | Sports | Nintendo | 41.49 | 29.02 | 3.77 | 8.46 | 82.74 |
| 1 | 2 | Super Mario Bros. | NES | 1985.0 | Platform | Nintendo | 29.08 | 3.58 | 6.81 | 0.77 | 40.24 |
| 2 | 3 | Mario Kart Wii | Wii | 2008.0 | Racing | Nintendo | 15.85 | 12.88 | 3.79 | 3.31 | 35.82 |
| 3 | 4 | Wii Sports Resort | Wii | 2009.0 | Sports | Nintendo | 15.75 | 11.01 | 3.28 | 2.96 | 33.00 |
| 4 | 5 | Pokemon Red/Pokemon Blue | GB | 1996.0 | Role-Playing | Nintendo | 11.27 | 8.89 | 10.22 | 1.00 | 31.37 |
| 5 | 6 | Tetris | GB | 1989.0 | Puzzle | Nintendo | 23.20 | 2.26 | 4.22 | 0.58 | 30.26 |
| 6 | 7 | New Super Mario Bros. | DS | 2006.0 | Platform | Nintendo | 11.38 | 9.23 | 6.50 | 2.90 | 30.01 |
| 7 | 8 | Wii Play | Wii | 2006.0 | Misc | Nintendo | 14.03 | 9.20 | 2.93 | 2.85 | 29.02 |
| 8 | 9 | New Super Mario Bros. Wii | Wii | 2009.0 | Platform | Nintendo | 14.59 | 7.06 | 4.70 | 2.26 | 28.62 |
| 9 | 10 | Duck Hunt | NES | 1984.0 | Shooter | Nintendo | 26.93 | 0.63 | 0.28 | 0.47 | 28.31 |
# Shape of data frame
df.shape
(16598, 11)
# In this data analysis we won't use Rank feature, so we will drop it
df = df.drop('Rank', axis = 1)
df.head()
| Name | Platform | Year | Genre | Publisher | NA_Sales | EU_Sales | JP_Sales | Other_Sales | Global_Sales | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Wii Sports | Wii | 2006.0 | Sports | Nintendo | 41.49 | 29.02 | 3.77 | 8.46 | 82.74 |
| 1 | Super Mario Bros. | NES | 1985.0 | Platform | Nintendo | 29.08 | 3.58 | 6.81 | 0.77 | 40.24 |
| 2 | Mario Kart Wii | Wii | 2008.0 | Racing | Nintendo | 15.85 | 12.88 | 3.79 | 3.31 | 35.82 |
| 3 | Wii Sports Resort | Wii | 2009.0 | Sports | Nintendo | 15.75 | 11.01 | 3.28 | 2.96 | 33.00 |
| 4 | Pokemon Red/Pokemon Blue | GB | 1996.0 | Role-Playing | Nintendo | 11.27 | 8.89 | 10.22 | 1.00 | 31.37 |
# Renaming Sales columns to better look visualization
df.columns = df.columns.str.replace('_', ' ')
df.head()
| Name | Platform | Year | Genre | Publisher | NA Sales | EU Sales | JP Sales | Other Sales | Global Sales | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Wii Sports | Wii | 2006.0 | Sports | Nintendo | 41.49 | 29.02 | 3.77 | 8.46 | 82.74 |
| 1 | Super Mario Bros. | NES | 1985.0 | Platform | Nintendo | 29.08 | 3.58 | 6.81 | 0.77 | 40.24 |
| 2 | Mario Kart Wii | Wii | 2008.0 | Racing | Nintendo | 15.85 | 12.88 | 3.79 | 3.31 | 35.82 |
| 3 | Wii Sports Resort | Wii | 2009.0 | Sports | Nintendo | 15.75 | 11.01 | 3.28 | 2.96 | 33.00 |
| 4 | Pokemon Red/Pokemon Blue | GB | 1996.0 | Role-Playing | Nintendo | 11.27 | 8.89 | 10.22 | 1.00 | 31.37 |
# Checkout missing values
# As we can see there are missing values in Year and Publisher columns
# There are not so many rows to have an impact on statistics, so we will delete them
df.isna().sum()
Name 0 Platform 0 Year 271 Genre 0 Publisher 58 NA Sales 0 EU Sales 0 JP Sales 0 Other Sales 0 Global Sales 0 dtype: int64
# Delete rows with null values
df = df.dropna()
# Changing Year column data type to int for better visualization
df['Year'] = df['Year'].astype(int)
df.head()
| Name | Platform | Year | Genre | Publisher | NA Sales | EU Sales | JP Sales | Other Sales | Global Sales | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Wii Sports | Wii | 2006 | Sports | Nintendo | 41.49 | 29.02 | 3.77 | 8.46 | 82.74 |
| 1 | Super Mario Bros. | NES | 1985 | Platform | Nintendo | 29.08 | 3.58 | 6.81 | 0.77 | 40.24 |
| 2 | Mario Kart Wii | Wii | 2008 | Racing | Nintendo | 15.85 | 12.88 | 3.79 | 3.31 | 35.82 |
| 3 | Wii Sports Resort | Wii | 2009 | Sports | Nintendo | 15.75 | 11.01 | 3.28 | 2.96 | 33.00 |
| 4 | Pokemon Red/Pokemon Blue | GB | 1996 | Role-Playing | Nintendo | 11.27 | 8.89 | 10.22 | 1.00 | 31.37 |
# As we can see there are too few values in 2016, 2017, 2020. So we will delete them
df.groupby('Year').sum().tail()
| NA Sales | EU Sales | JP Sales | Other Sales | Global Sales | |
|---|---|---|---|---|---|
| Year | |||||
| 2014 | 131.97 | 125.63 | 39.46 | 40.02 | 337.03 |
| 2015 | 102.82 | 97.71 | 33.72 | 30.01 | 264.44 |
| 2016 | 22.66 | 26.76 | 13.67 | 7.75 | 70.90 |
| 2017 | 0.00 | 0.00 | 0.05 | 0.00 | 0.05 |
| 2020 | 0.27 | 0.00 | 0.00 | 0.02 | 0.29 |
# We will rewrite all years, except 2016, 2017 and 2020
df = df[(df['Year'] != 2016) & (df['Year'] != 2017) & (df['Year'] != 2020)]
df.groupby('Year').sum().tail()
| NA Sales | EU Sales | JP Sales | Other Sales | Global Sales | |
|---|---|---|---|---|---|
| Year | |||||
| 2011 | 241.00 | 167.31 | 53.04 | 54.39 | 515.80 |
| 2012 | 154.93 | 118.76 | 51.74 | 37.82 | 363.49 |
| 2013 | 154.77 | 125.80 | 47.59 | 39.82 | 368.11 |
| 2014 | 131.97 | 125.63 | 39.46 | 40.02 | 337.03 |
| 2015 | 102.82 | 97.71 | 33.72 | 30.01 | 264.44 |
Before start, we shoud import all needed libraries.
# For data visualization we will use matplotlib and plotly libraries
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px
from ipywidgets import interact
# Define style and size of our plots.
plt.style.use('ggplot')
plt.rc('figure', figsize = (16, 9))
plt.rc('font', size = 14)
# Preparing data frame for visualization
game_prevalance = df.groupby('Year').sum().drop('Global Sales', axis = 1)
game_prevalance.head()
| NA Sales | EU Sales | JP Sales | Other Sales | |
|---|---|---|---|---|
| Year | ||||
| 1980 | 10.59 | 0.67 | 0.00 | 0.12 |
| 1981 | 33.40 | 1.96 | 0.00 | 0.32 |
| 1982 | 26.92 | 1.65 | 0.00 | 0.31 |
| 1983 | 7.76 | 0.80 | 8.10 | 0.14 |
| 1984 | 33.28 | 2.10 | 14.27 | 0.70 |
# Making a plot for all regions with loop
for i in game_prevalance.columns:
plt.plot(game_prevalance[i], label = i.split()[0], linewidth = 3.0)
plt.title('Figure 1.1 Game Prevalance over years', fontsize = 18)
plt.xlabel('Year', fontsize = 16)
plt.ylabel('Sales (in millions)', fontsize = 16)
plt.legend(fontsize = 16, loc = 'upper left')
plt.show()
As we can see on Figure 1.1, North America region have a majority in game sales all over the time. And the first game sales accounted for it. The first significant jump was in 1995 for NA, after that in 2000 massive increase for both NA and EU regions. This brings us to the fact that trend for NA and EU is similar. On the other side there are sales in Japan and other world. Graph of the other regions after 2005 become more like trends in NA and EU, but the JP has another trends which we will see later in our analysis.
Now we will look a bar plot representation for game prevalance:
# Preparing data frame for visualization
prevalance_seg = df.groupby((df['Year']//5)*5).sum().drop(columns = {'Year', 'Global Sales'}, axis = 1)
prevalance_seg
| NA Sales | EU Sales | JP Sales | Other Sales | |
|---|---|---|---|---|
| Year | ||||
| 1980 | 111.95 | 7.18 | 22.37 | 1.59 |
| 1985 | 123.71 | 24.02 | 80.12 | 5.54 |
| 1990 | 115.36 | 42.82 | 117.89 | 6.88 |
| 1995 | 460.75 | 240.05 | 254.44 | 40.54 |
| 2000 | 897.05 | 467.11 | 200.24 | 134.91 |
| 2005 | 1506.17 | 786.97 | 310.45 | 329.72 |
| 2010 | 986.91 | 714.07 | 251.32 | 231.95 |
| 2015 | 102.82 | 97.71 | 33.72 | 30.01 |
# Creating bar plot
fig = px.bar(prevalance_seg, x = prevalance_seg.index, y = prevalance_seg.columns,
labels = {
'value' : 'Sales(in millions)',
'variable' : 'Region'
},
title = 'Figure 1.2 Game Prevalance every 5 years')
fig['layout']['font'] = dict(size=13)
fig.show()
For this graph we combined our Year feature to a periods of time. We will divide it by 7 periods of 5 years. We could it on figure 1.2. From this bar chart it becomes clear that first game sales was not only for NA region, but also for JP in less amount of game copies. In another 5 Japan sales approaching to NA and Europe region also started to sale games. The significant changes we could see from 1995 to 2000 as like as we saw in Figure 1.1. And the peak of sales was from 2005 to 2010, then in the periods between 2010 and 2015 sales significantly goes down.
# For this visualisation we will need only Year and number of our games
released_games = df.groupby('Year').count().iloc[:, :1]
released_games = released_games.rename(columns = {'Name':'Released'})
released_games.head(10)
| Released | |
|---|---|
| Year | |
| 1980 | 9 |
| 1981 | 46 |
| 1982 | 36 |
| 1983 | 17 |
| 1984 | 14 |
| 1985 | 14 |
| 1986 | 21 |
| 1987 | 16 |
| 1988 | 15 |
| 1989 | 17 |
# For visualization we will create a bar chart
plt.bar(released_games.index, released_games['Released'], color = '#2dd4ed', edgecolor = 'black')
plt.xlabel('Year', fontsize = 16)
plt.ylabel('Released games', fontsize = 16)
plt.title('Figure 2. Number of games released every year', fontsize = 18)
plt.show();
Here on Figure 2 we can see that as the game industry grew many companies started to make more games. Also we can see that the first increasing of game releases started in 1995, then in 2003 and peak of released games was in 2008 and 2009. If we compare this graph(Figure 2) with the Game prevalance graph(Figure 1.1) we could see that they have a similar trend.
# Preparing data frame for visualization
genres_in_region = df.groupby('Genre').sum().drop(columns = {'Year', 'Global Sales'}, axis = 1)
genres_in_region
| NA Sales | EU Sales | JP Sales | Other Sales | |
|---|---|---|---|---|
| Genre | ||||
| Action | 855.90 | 510.12 | 152.86 | 183.09 |
| Adventure | 101.59 | 63.35 | 51.04 | 16.59 |
| Fighting | 219.14 | 98.85 | 86.51 | 35.73 |
| Misc | 396.70 | 211.68 | 105.86 | 73.89 |
| Platform | 445.20 | 199.78 | 130.54 | 51.20 |
| Puzzle | 122.01 | 50.52 | 56.68 | 12.47 |
| Racing | 356.60 | 235.17 | 56.60 | 76.49 |
| Role-Playing | 325.11 | 186.28 | 346.62 | 58.94 |
| Shooter | 567.72 | 302.75 | 37.57 | 99.48 |
| Simulation | 181.51 | 112.93 | 63.24 | 31.34 |
| Sports | 665.52 | 363.98 | 133.98 | 130.73 |
| Strategy | 67.72 | 44.52 | 49.05 | 11.19 |
def x(Region):
# Preparing data frame for specific region
genre_df = genres_in_region.sort_values(Region, ascending = True)
reg = Region.split()[0]
# Setting colors for different regions
if reg == 'NA':
color = '#4446cf'
elif reg == 'EU':
color = '#8a44cf'
elif reg =='JP':
color = '#cf446e'
else:
color = '#448ccf'
# Creating bar plot with sorting by amount of sales
plt.barh(genre_df.index,
genre_df[Region],
color = color, edgecolor = 'black')
# Setting label and title
plt.xlabel('Sales (in millions)', fontsize = 16)
plt.title(f'Figure 3. Most popular genres in {reg}', fontsize = 18)
plt.show()
interact(x, Region = genres_in_region.columns);
On figure 3 we visualized 4 our regions to see what genres are preferred by gamers of this region. Comparing regions between each other brings us to the fact that similarity with NA, EU and other region is high, on the other side is JP with a absolutely different trend. The top 3 game genres for NA, EU and other world is: Action, Sports and Shooters. Other games have a different places, but general trend is the same.
In Japan most popular genre is RPG, and its dominance on the market clearly visible. RPG sales in JP is more than sales of second top genre in 2 times. Also in this region least popular genre is shooters. But if we look at the other regions shooters is a top 3 genre. We could explain this that Japanese culture very different and the games they prefer also differs.
# Data frame we will use in this visualisation
pr = df.groupby('Publisher').sum().drop('Year', axis = 1).sort_values('Global Sales', ascending = False)
pr.head(20)
| NA Sales | EU Sales | JP Sales | Other Sales | Global Sales | |
|---|---|---|---|---|---|
| Publisher | |||||
| Nintendo | 814.59 | 417.37 | 453.82 | 94.99 | 1780.96 |
| Electronic Arts | 580.58 | 360.47 | 13.89 | 126.00 | 1081.14 |
| Activision | 424.34 | 212.33 | 6.40 | 74.35 | 717.74 |
| Sony Computer Entertainment | 262.79 | 184.68 | 73.88 | 79.35 | 600.72 |
| Ubisoft | 248.69 | 158.65 | 7.09 | 48.86 | 463.49 |
| Take-Two Interactive | 218.64 | 117.33 | 5.81 | 54.76 | 396.41 |
| THQ | 208.60 | 94.60 | 5.01 | 32.11 | 340.44 |
| Konami Digital Entertainment | 88.90 | 68.14 | 90.30 | 29.84 | 277.35 |
| Sega | 108.65 | 81.27 | 55.70 | 24.27 | 269.89 |
| Namco Bandai Games | 67.38 | 41.03 | 124.50 | 14.06 | 247.16 |
| Microsoft Game Studios | 154.69 | 68.10 | 3.11 | 18.45 | 244.37 |
| Capcom | 77.62 | 38.65 | 66.85 | 14.58 | 197.83 |
| Atari | 101.23 | 25.78 | 10.70 | 8.73 | 146.75 |
| Square Enix | 48.12 | 31.90 | 47.56 | 13.67 | 141.20 |
| Warner Bros. Interactive Entertainment | 73.97 | 47.15 | 1.02 | 16.69 | 138.82 |
| Disney Interactive Studios | 70.44 | 34.36 | 0.56 | 13.15 | 118.76 |
| Eidos Interactive | 47.85 | 34.85 | 6.11 | 7.90 | 96.75 |
| LucasArts | 48.43 | 26.00 | 0.20 | 10.28 | 84.95 |
| Bethesda Softworks | 38.49 | 29.49 | 1.42 | 9.81 | 79.28 |
| Midway Games | 45.10 | 18.22 | 0.12 | 5.69 | 69.29 |
def x(Region):
# Preparing data frame for specific region
pub_in_region = df.groupby('Publisher').sum().drop('Year', axis = 1).sort_values(Region, ascending = False).head(20)
# Creation a plot
fig = px.bar(pub_in_region, x = pub_in_region.index, y = Region,
labels = {
Region : 'Sales(in millions)'
},
title = f'Figure 4. Top 20 Game Publishers of all time in {Region.split()[0]}')
fig['layout']['font'] = dict(size=13)
fig.show()
interact(x, Region = pr.columns);
This bar plot shows us a top 20 game publishers of all time by sold games in 4 different regions. Also, there is a possibility to see a global statistics. From the Figure 4 we can say, that most popular publisher is surely japanese company Nintendo. The trend of all of the regions except Japan is almost the same. But again the JP region have a completely different favorite Publishers. Top 7 Publishers in JP chart is Japanese companies. And if everything with Nintendo is clear, the situation with other companies is brings us to the fact, that Japanese Publishers is more like to release a game that will be mostly bought in JP region than in other.
Based on the analysis we made, we can conclude several things:
Summing up this points, we could say that a general trend on the games more-less the same for all region except Japan. We assume that due to difference between Eastern and Western culture there is also a difference in a game they preffer. It's more likely that japanese gamer will buy a game that released by japanese publisher. According to our analysis of the most popular genres in regions, we can see that Japanese mostly preferes RPGs games. Due to this japanese Publishers is more interested in releasing a RPGs to satisfy their local market which is being their main.